A Novel Feature Combination Approach for Spoken Document Classification with Support Vector Machines
نویسندگان
چکیده
In most approaches to text classification, the basic units (terms) used to represent a document are: words (with or without stemming), n-gram characters, phonemes, syllables, multi-words, etc. However, these units are always used exclusively. In this paper, a novel approach is presented that combines two types of such units to represent a document for text classification. Our experiments show that, if appropriately chosen, the combined terms will result in better recognition rates than using only one type of those terms. The new approach is tightly related to the high level of redundancy property, which is a desirable property for text classification tasks and which is directly connected to the margin theory of Support Vector Machines [14]. The topic classification approach presented here is designed to be part of the ALERT system that automatically scans multimedia data like TV or radio broadcasts for the presence of pre-specified topics. Therefore, tests are mainly conducted on the (German) transcription output of an automatic speech recognizer (ASR). Since for German document classification, n-gram character features are more robust against errors (e.g. from ASR) in text and yield better results than word level features [7], we combined these two types of terms together for representing a document. In most cases, this approach gives better results than n-gram character features or word features alone. We also applied our feature combination approach to text classification of the wellknown English Reuters corpus (which is based on plain text, not ASR output). It shows that appropriate combination of different type of features also gives a slightly better result. Additionally, we will introduce a soundex text representation scheme which, when used in combination with other feature types, can help the text classification task.
منابع مشابه
A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کاملA QUADRATIC MARGIN-BASED MODEL FOR WEIGHTING FUZZY CLASSIFICATION RULES INSPIRED BY SUPPORT VECTOR MACHINES
Recently, tuning the weights of the rules in Fuzzy Rule-Base Classification Systems is researched in order to improve the accuracy of classification. In this paper, a margin-based optimization model, inspired by Support Vector Machine classifiers, is proposed to compute these fuzzy rule weights. This approach not only considers both accuracy and generalization criteria in a single objective fu...
متن کاملFace Recognition using Eigenfaces , PCA and Supprot Vector Machines
This paper is based on a combination of the principal component analysis (PCA), eigenface and support vector machines. Using N-fold method and with respect to the value of N, any person’s face images are divided into two sections. As a result, vectors of training features and test features are obtain ed. Classification precision and accuracy was examined with three different types of kernel and...
متن کاملSpoken language classification using hybrid classifier combination
In this paper we describe an approach for spoken language analysis for helpdesk call routing using a combination of simple recurrent networks and support vector machines. In particular we examine this approach for its potential in a difficult spoken language classification task based on recorded operator assistance telephone utterances. We explore simple recurrent networks and support vector ma...
متن کاملAutomatic Interpretation of UltraCam Imagery by Combination of Support Vector Machine and Knowledge-based Systems
With the development of digital sensors, an increasing number of high-resolution images are available. Interpretation of these images is not possible manually, which necessitates seeking for practical, fast and automatic solutions to solve the environmental and location-based management problems. The land cover classification using high-resolution imagery is a difficult process because of the c...
متن کامل